WB Project PDO text analysis

Author

Luisa M. Mimmi

Published

September 25, 2024

Work in progress

Set up

# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rsample) # General Resampling Infrastructure
library(rvest) # Easily Harvest (Scrape) Web Pages
library(cleanNLP) # A Tidy Data Model for Natural Language Processing
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)

— Note on cleanNLP package

cleanNLP supports multiple backends for processing text, such as CoreNLP, spaCy, udpipe, and stanza. Each of these backends has different capabilities and might require different initialization procedures.

  • CoreNLP ~ powerful Java-based NLP toolkit developed by Stanford, which includes many linguistic tools like tokenization, part-of-speech tagging, and named entity recognition.
    • ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with cnlp_install_corenlp() which installs the Java JAR files and models)
  • spaCy ~ fast and modern NLP library written in Python. It provides advanced features like dependency parsing, named entity recognition, and tokenization.
      • ❕❗️ NEEDS EXTERNAL INSTALLATION (fust be installed in Python (with spacy_install() which installs both spaCy and necessary Python dependencies) and the spacyr R package must be installed to interface with it.
  • udpipe ~ R package that provides bindings to the UDPipe NLP toolkit. Fast, lightweight and language-agnostic NLP library for tokenization, part-of-speech tagging, lemmatization, and dependency parsing.
  • stanza~ another modern NLP library from Stanford, similar to CoreNLP but built on PyTorch and supports over 66 languages…

when you initialize a backend (like CoreNLP) in cleanNLP, it stays active for the entire session unless you reinitialize or explicitly change it.

# ---- 1) Initialize the CoreNLP backend
library(cleanNLP)
cnlp_init_corenlp()
# If you want to specify a language or model path:
cnlp_init_corenlp(language = "en", 
                  # model_path = "/path/to/corenlp-models"
                  )

# ---- 2) Initialize the spaCy backend 
library(cleanNLP)
library(spacyr)
# Initialize spaCy in cleanNLP
cnlp_init_spacy()
# Optional: specify language model
cnlp_init_spacy(model_name = "en_core_web_sm")

# ---- 3) Initialize the udpipe backend
library(cleanNLP)
# Initialize udpipe backend
cnlp_init_udpipe(model_name = "english")

# ---- 4) Initialize the stanza backend

—————————————————————————-

Data sources

WB Projects & Operations

World Bank Projects & Operations can be explored at:

  1. Data Catalog. From which
  1. Advanced Search

—————————————————————————

Load pre-processed Projs’ PDO dataset pdo_train_t

[Saved file projs_train_t ]

Done in ** analysis/_01a_WB_project_pdo_prep.qmd “**

  1. I retrieved manually ALL WB projects approved between FY 1947 and 2026 as of 31/08/2024 using simply the Excel button on this page WBG Projects
    • By the way this is the link “list-download-excel”
    • then saved HUUUGE .xls files in data/raw_data/project2/all_projects_as_of29ago2024.xls
      • (plus a Rdata copy of the original file )
  2. Split the dataset and keep only projs_train (50% of projects with PDO text, i.e. 4413 PDOs)
  3. Clean the dataset and save projs_train_t (cleaned train dataset)
  4. Obtain PoS tagging + tokenization with cleanNLP package (functions cnlp_init_udpipe() + cnlp_annotate()) and saved projs_train_t (cleaned train dataset).
# Load clean Proj PDO train dataset `pdo_train_t`
pdo_train_t <- readRDS(here::here("data" , "derived_data", "pdo_train_t.rds"))

Explain Tokenization and PoS Tagging

i) Tokenization

Where a word is more abstract, a “type” is a concrete term used in actual language, and a “token” is the particular instance we’re interested in (e.g. abstract things (‘wizards’) and individual instances of the thing (‘Harry Potter.’). Breaking a piece of text into words is thus called “tokenization”, and it can be done in many ways.

The choices of tokenization

  1. Should words be lower cased?
  2. Should punctuation be removed?
  3. Should numbers be replaced by some placeholder?
  4. Should words be stemmed (also called lemmatization). ☑️
  5. Should bigrams/multi-word phrase be used instead of single word phrases?
  6. Should stopwords (the most common words) be removed? ☑️
  7. Should rare words be removed?
  8. Should hyphenated words be split into two words? ❌

for the moment I keep all as conservatively as possible

ii) Pos Tagging

Classifying noun, verb, adjective, etc. can help discover intent or action in a sentence, or scanning “verb-noun” patterns. Here I have a training dataset file with:

Variable Type Provenance Description
proj_id chr original PDO data
pdo chr original PDO data
word_original chr original PDO data
sid int output cleanNLP sentence ID
tid chr output cleanNLP token ID within sentence
token chr output cleanNLP Tokenized form of the token.
token_with_ws chr output cleanNLP Token with trailing whitespace
lemma chr output cleanNLP The base form of the token
upos chr output cleanNLP Universal part-of-speech tag (e.g., NOUN, VERB, ADJ).
xpos chr output cleanNLP Language-specific part-of-speech tags.
feats chr output cleanNLP Morphological features of the token
tid_source chr output cleanNLP Token ID in the source document
relation chr output cleanNLP Dependency relation between the token and its head token
pr_name chr output cleanNLP Name of the parent token
FY_appr dbl original PDO data
FY_clos dbl original PDO data
status chr original PDO data
regionname chr original PDO data
countryname chr original PDO data
sector1 chr original PDO data
theme1 chr original PDO data
lendinginstr chr original PDO data
env_cat chr original PDO data
ESrisk chr original PDO data
curr_total_commitment dbl original PDO data

— PoS Tagging: upos (Universal Part-of-Speech)

upos n percent explan
ADJ 21852 0.0853714 Adjective
ADP 27848 0.1087965 Adposition
ADV 3010 0.0117595 Adverb
AUX 3738 0.0146036 Auxiliary
CCONJ 14486 0.0565939 Coordinating conjunction
DET 22121 0.0864223 Determiner
INTJ 81 0.0003165 Interjection
NOUN 72668 0.2838993 Noun
NUM 2285 0.0089270 Numeral
PART 8846 0.0345595 Particle
PRON 2351 0.0091849 Pronoun
PROPN 14860 0.0580550 Proper noun
PUNCT 29442 0.1150240 Punctuation
SCONJ 2219 0.0086692 Subordinating conjunction
SYM 348 0.0013596 Symbol
VERB 26397 0.1031278 Verb
X 3412 0.0133300 Other

iii) Make low case

pdo_train_t <- pdo_train_t %>% 
  mutate(token_l = tolower(token)) %>% 
   relocate(token_l, .after = token) %>% 
   select(-token_with_ws) %>%
  #Replace variations of "hyphenword" with "-"
  mutate(
    lemma = str_replace_all(lemma, regex("hyphenword|hyphenwor", ignore_case = TRUE), "-")
  ) %>%
   mutate(stem = wordStem(token_l)) %>%
   relocate(stem, .after = lemma)

iv) Stemming

_______

TEXT ANALYSIS/SUMMARY

_______

NOTE: Among word / stems encountered in PDOs, there are a lot of acronyms which may refer to World Bank lingo, or local agencies, etc… Especially when looked at in low case form they don’t make much sense…

see https://cengel.github.io/R-text-analysis/textanalysis.html

Frequencies of documents/words/stems

# Count words
counts_pdo <- pdo_train_t %>%
     count(pdo, sort = TRUE)  # 4,071

counts_words <- pdo_train_t %>%
     count(word_original, sort = TRUE)  # 13,441

counts_token <- pdo_train_t %>%
  count(token, sort = TRUE)   # 13,420

counts_lemma <- pdo_train_t %>%
  count(lemma, sort = TRUE)   # 11,705

counts_stem <- pdo_train_t %>%
  count(stem, sort = TRUE)   # 8,812

We are looking at pdo_train_t which has 134,858 rows and 7 columns.

  • PDOs = 4071 in projects
    • ranging from 2001 to 2023
Column1 Column2
N proj 4413
N PDOs 4071
N words 13231
N token 11399
N lemma 11474
N stem 8812

We are looking at pdo_train_t which has 134,858 rows and 7 columns.

[FUNC] save plots

f_save_plot <- function(plot_name, plot_object) {
  # Print the plot, save as PDF and PNG
  plot_object %T>%
    print() %T>%
    ggsave(., filename = here("analysis", "output", "figures", paste0(plot_name, ".pdf")),
           # width = 4, height = 2.25, units = "in",
           device = cairo_pdf) %>%
    ggsave(., filename = here("analysis", "output", "figures", paste0(plot_name, ".png")),
           # width = 4, height = 2.25, units = "in",
           type = "cairo", dpi = 300)
}

# Example of using the function
# f_save_plot("proj_wrd_freq", proj_wrd_freq)

[FIG] Overall token freq ggplot

  • Without “project” “develop”,“objective”
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent token in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$boardapprovalFY)} and {max(pdo_train_t$boardapprovalFY)}") 

proj_wrd_freq <- pdo_train_t %>%   # 256,632
   filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
 filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
                          "is"))) %>% # whne it is VERB
   count(token_l) %>% 
   filter(n > 800) %>% 
   mutate(token_l = reorder(token_l, n)) %>%  # reorder values by frequency
   # plot 
   ggplot(aes(token_l, n)) +
   geom_col(fill = "gray") +
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(title = title_text,
        subtitle = "[token_l count > 800]", y = "", x = "")

proj_wrd_freq

f_save_plot("proj_wrd_freq", proj_wrd_freq)

[FIG] Overall stem freq ggplot

  • Without “project” “develop”,“objective”
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent STEM in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$boardapprovalFY)} and {max(pdo_train_t$boardapprovalFY)}") 

proj_stem_freq <- pdo_train_t %>%   # 256,632
   filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
 filter (!(stem %in% c("pdo","project", "develop", "object", "i", "ii", "iii"))) %>%
   count(stem) %>% 
   filter(n > 800) %>% 
   mutate(stem = reorder(stem, n)) %>%  # reorder values by frequency
   # plot 
   ggplot(aes(stem, n)) +
   geom_col(fill = "gray") +
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(title = title_text,
        subtitle = "[stem count > 800]", y = "", x = "")

proj_stem_freq

f_save_plot("proj_stem_freq", proj_stem_freq)

Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800.

_______

>>>>>> QUI <<<<<<<<<<<<<<<<<<

Main ref https://www.nlpdemystified.org/course/advanced-preprocessing rivedere cos’avevo fatto x pulire in analysis//03_WDR_pdotracs_explor.qmd https://cengel.github.io/R-text-analysis/textprep.html#detecting-patterns https://guides.library.upenn.edu/penntdm/r https://smltar.com/stemming#how-to-stem-text-in-r BOOK STEMMING

_______

Isolate SECTOR words and see frequency over years

To try and make it a bit more meaningful, we can look at the frequency of the most common words in related to SECTORs.

I also create a “boad sector” variable to group the sectors in broader definitions:

  • water = water, wastewater, sanitation
  • transport = transport, railway, road, airport, port
  • energy = energy, electricity, hydroelectric, hydropower, renewable, transmission
  • health = health, hospital, medicine, drugs, epidemic, pandemic, covid-19, vaccine

… P123322

pdo_train_t <- pdo_train_t %>%
   # dealing with water/watershed/waterway
   mutate(token_l_broad = case_when(
      str_detect(token_l, "water|wastewater|sanitat") ~ "water",
      str_detect(token_l, "transport|railway|road|airport") ~ "transport",
      token_l == "port" ~ "transport",
      str_detect(token_l, "urban") ~ "urban",
      str_detect(token_l, "energ|electri|hydroele|hydropow|renewable|transmis") ~ "energy",  # Matches either "energy" or "power"
      str_detect(token_l, "health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin") ~ "health",
      TRUE ~ NA_character_)) %>% 
   relocate(token_l_broad, .after = token_l) # move the new column to the right of token_l
# prepare data for plotting
sector_broad <- pdo_train_t %>% 
   filter(!is.na(token_l_broad)) %>% 
   count(FY_appr, token_l_broad) %>% 
   filter(n > 0) %>% 
   mutate(token_l_broad = factor(token_l_broad, levels = c("water", "transport", "urban", "energy", "health"))) # reorder values by frequency
#df$FY

# plot
proj_sect_broad_fr <-  ggplot(data = sector_broad, 
                              aes(x = FY_appr, y = n, 
                                  group = token_l_broad, color = token_l_broad)) +
   geom_line() +
   geom_point() +
   scale_x_continuous(breaks =  seq(2001, 2023, by=  2)) +
   scale_color_viridis_d(option = "magma", end = 0.9) + 
   facet_wrap(~token_l_broad, ncol = 2, scales = "free")+   guides(color = FALSE) +
   theme_bw()+
   theme(# Adjust angle and alignment of x labels
      axis.text.x = element_text(angle = 45, hjust = 1)) + 
   labs(title = "Sector words frequency in PDO over Fiscal Years",
        subtitle = "[Using \"custom\" broad sector definition]",
        x =   "Board approval FY", y = "Counts of 'sector' word (token_l_broad)") + 
   geom_vline(data = subset(sector_broad, token_l_broad == "health"), aes(xintercept = 2020), 
              linetype = "dashed", color = "#9b6723") +
   geom_text(data = subset(sector_broad, token_l_broad == "health"), 
             aes(x = 2020, y = max(sector_broad$n)*0.75, label = "Covid"), 
             angle = 90, vjust = -0.5, color = "#9b6723")


proj_sect_broad_fr

f_save_plot("proj_sect_broad_fr", proj_sect_broad_fr)

Isolate INSTITUTIONAL words and see frequency over years

df <- pdo_train_t %>%
   mutate (stem = if_else (stem == "sme" | stem ==  "msme" , "sme-msme", stem)) %>%
   filter (stem %in% c( "public","privat", "govern","ngo", "enterpris", "sme-msme")) %>%
   mutate (FY = boardapprovalFY) %>%
   # group_by(FY) %>% 
   #summarize (n_rep = length(stem)) %>%
   count(FY,  stem)  
   

#df$FY

proj_inst_stem_fr <-  ggplot(data = df, aes(x = FY, y = n, group = stem, color = stem)) +
   geom_line() +
   geom_point() +
   scale_x_continuous(breaks =  seq(2001, 2023, by=  2)) +
   scale_color_viridis_d(option = "magma", end = 0.9) + 
   # Reorders the stem variable based on the total count (n), with .desc = TRUE to order from highest to lowest
   facet_wrap(~ fct_reorder(stem, n, .fun = sum, .desc = TRUE), ncol = 2, scales = "free_y") +
   guides(color = FALSE) +
   theme_bw()+
   theme(# Adjust angle and alignment of x labels
      axis.text.x = element_text(angle = 45, hjust = 1)) + 
   labs(title = "Sector words frequency in PDO over Fiscal Years",x =   "Board approval FY", y = "Counts of 'sector' word (stem)")  # + 
# geom_vline(data = subset(df, stem == "health"), aes(xintercept = 2020), 
#            linetype = "dashed", color = "#9b6723") +
# geom_text(data = subset(df, stem == "health"), aes(x = 2020, y = max(df$n)*0.85, label = "Covid"), 
#           angle = 90, vjust = -0.5, color = "#9b6723")


proj_inst_stem_fr
f_save_plot("proj_inst_stem_fr", proj_inst_stem_fr)

Term frequency

Word and document frequency: Tf-idf

The goal is to quantify what a document is about. What is the document about?

  • term frequency (tf) = how frequently a word occurs in a document… but there are words that occur many time and are not important
  • term’s inverse document frequency (idf) = decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
  • statistic tf-idf (= tf-idf) = an alternative to using stopwords is the frequency of a term adjusted for how rarely it is used. [It measures how important a word is to a document in a collection (or corpus) of documents, but it is still a rule-of-thumb or heuristic quantity]

The tf-idf is the product of the term frequency and the inverse document frequency::

N-Grams

Co-occurrence

_______

>>>>>> NEXT <<<<<<<<<<<<<<<<<<

_______

Named Entity Recognition using CleanNLP and spaCy

NER is especially useful for analyzing unstructured text.

— Summarise the tokens by parts of speech

# Initialize the spacy backend
cnlp_init_spacy() 
quarto render analysis/01b_WB_project_pdo_anal.qmd --to html
open ./docs/analysis/01b_WB_project_pdo_anal.html